# `lsup_repo`

**This software is work in progress.**

`lsup_repo` is a C and Python library providing embedded (server-less)
data repository services. It builds upon a foundational library, [`lsup_rdf`](
https://notabug.org/scossu/lsup_rdf) which handles RDF and graph data.

## Background & scope

Lakesuperior was initially built upon the Fedora repository software. This new
version is a complete reingeineering and repurposing of the previous software,
seeking to provide similar repository services with a simplified set of
concepts and constrains, focused on usability and simplicity of design. Unlike
Fedora, it does not aim to adhere to, alter, or set any API standard.
Long-term sustainability of the handled data is provided by transparent exports
into commonly parsable data formats (RDF) and tools to rebuild a repository
from data files.

That said, nothing would prevent someone to add support for LDP, OCFL, Memento,
etc. and even create a complete Fedora implementation by using this library as
a foundation the basic repository functionality.

## Use

`lsup_repo` can be included in a C or Python program to manage the life cycle
of RDF and non-RDF data. It allows to store and manage documents of any format
and size, and catalog them via RDF metadata. 

`lsup_repo` does not need to run a server for its core functionality. The
interaction with the library is done via a C or Python API. That said, a REST
API or any other type of server can be built with relative ease on top of this
library. A separate project, based on this library, may in the future provide a
REST API for generic resource management, likely based on some existing
standard.

The current goal of this development is to build a minimum-viable product (MVP)
to replace the essential functionality of a previous project,
[Lakesuperior](https://notabug.org/scossu/lakesuperior).

## Status

Pre-alpha. Currently at the beginning of the implementation phase. The
structure of the code may change radically. Features mentioned below are to be
intended as goals.

## Features

### Short Term (MVP)

- Handle the life cycle of arbitrary data documents (local disk storage)
- Handle the life cycle of RDF resources describing such documents
- Basic organizational structures: sets, lists, proxies (see "Concepts" below)
- Versioning: create a version, revert to a version, "soft" deletion and
  reinstatement of a deleted resource
- Checksum: cryptographic checksum of data, fixity check of resources in
  transit and at rest on demand
- Basic management utilities: integrity check, statistics
- Serialization and de-serialization to/from Turtle, TriG, N-Triples, N-quads
- Python bindings

### Mid Term

- Dump, restore and migration utilities
- Notification stream
- Multiple back end options (local, S3, other network protocols)
- Authentication and access policies

### Long Term

- Checksum of RDF resources (depends on `lsup_rdf` development)
- Other features as they become necessary

## Concepts

### Data resources

At the center of Lakesuperior is the goal of storing and organizing arbitrary
files that can be found in a hard drive, remote server, etc. These files are
called *Data Resources* (DATA-R). Their contents are entirely opaque to
Lakesuperior, therefore any type of document can be handled.

### Descriptive Resources

Each data resource is accompanied by a *Descriptive Resource* (DESC-R). In the
first iteration of Lakesuperior this is a RDF named graph which at a minimum
contains a pointer to the data location and basic technical metadata. the URI
of the named graph is globally unique. Such resource stands for the non-RDF
resource in a Linked Data context. It can also be added user-defined metadata.

Descriptive resources may also exist independently of data resources for
cataloging and organizational purposes. They have a few characteristics in
common:

- They are made up of one or more named graphs stored in the Lakesuperior
  back end. The URIs of each named graph follow a specific naming pattern as
  well as having explicit links between them, so they can be retrieved as one
  unit.
- They are normal RDF resources, therefore their content is parsable by
  Lakesuperior and may be queried.
- They may consist of several named graphs, each with a purpose defined by the
  software: library-managed data, user-provided data, versioning data, etc.

Partitioning a DESC-R into multiple graphs allows individual data sets to be
annotated, e.g. to establish provenance or versioning information about the
asserted facts. Future developments of `lsup_repo` or software built upon it
may take advantage of this structure.

Triples in a DESC-R can have any subject; however it is recommended to maintain
some consistency about which subjects are treated in each resource.
Specifically, the use of a resource as an aggregation or container of triples
about multiple independent entities is discouraged in favor of the use of
dedicated data structures, as described below.

### Resource Structures

Descriptive resources can be organized in varios aggregation forms. The
aggregating resources are normal descriptive resources, with specific
predicates pointing to other descriptive resources.

It is important to notice that, unlike in Fedora or other LDP implementations,
the life cycle of resource aggregations is entirely independent of the
aggregated resources. In LDP, deleting a container would remove its contained
resources. Also, in LDP a resource can be only contained by a single container
(except in the case of indirect containers, to some extent). In Lakesuperior an
aggregation only "contains" pointers to other entirely independent resources,
which can be pointed to by an arbitrary number of other aggregations, which can
be removed at any time without changing the state of the aggregated resources.

On the other hand, deleting a resource that is part of some structure causes a
scan of all inbound links (see "Referential Integrity" below) and the removal
of all links to it present in other structures; therefore, the deletion of an
aggregated resource changes the state of its aggregations.

The types of structures foreseen for the first implementation of `lsup_repo`
are:

#### Set

A set is simply a descriptive resource containing an unordered number of unique
links to other descriptive resources. Any descriptive resource, including other
structures, can be used. Shorthand functions for counting and iterating over
Set members, as well as performing boolean operations on them, shall be made
available. As it is a descriptive resource, a Set may have descriptive metadata
added to it, such as taxonomy, descriptions, labels, etc.

#### List

A Lakesuperior *List* is the implementation of a "Linked List" data structure.
It contains a link to a single descriptive resource. This resource, called a
*List Item*, represents the first item in the list. Each list item, except for
the last one, contains a single link to the next list item.

In addition, every list item has either:

- A link to the resource it stands for: the List Item is a proxy for an
  existing resource, which makes it possible to make the same resource part of
  multiple list; or

- A link to another list, which results in a nested list.

Shorthand functions to perform common list operations shall be made available.
As with other descriptive resources, Lists and List Items can have any type of
user-defined metadata and relationships added.

#### Proxy

A List Item is a special case of a *Proxy*, which is a descriptive resource
standing for another descriptive resource. This indirection is useful for
adding a specific context to a resource, e.g. additional information on a
document in the context of a curated collection that is only valid or relevant
to that collection. Proxies can be aggregated in sets or other structures as
well, as one sees fit.

Proxy definitions follow the [OAI ORE](http://www.openarchives.org/ore/)
ontology.

### Referential Integrity

The concept of Linked data, which `lsup_repo` is partly built upon, does not
mandate the guarantee that a link pointing to a resource resolves to an actual
resource, since it is often impossible to determine which system is responsible
for managing that resource, let alone having any agency upon it. Therefore,
"broken links" are not excluded. 

`lsup_repo`, however, relies on the assumption that a specific set of resources
is under its full control, and therefore guarantees that all references to
internally managed resources are maintained at all times. This means that when
a resource is deleted, all links pointing to it are identified and removed.
This is called *referential integrity*.

Tools shall be developed to perform periodical referential integrity checks and
to notify of dangling links and/or repair them.

### Managed resources

Some Lakesuperior resources (DSC-R) and RDF terms may be managed by the
repository and are handled in a special way under most circumstances.

Examples of such resources can be:

- Support structures [TODO specify]
- RDF predicates
- RDF types

Some managed resources may be only handled by the user in different way
depending on the state of a resource. For example, a RDF type of `lsup:List`
can be specified by the user on creation, but after that it may not be modified
manually.

TODO A more detailed list of these managed resources and their behaviour will
be included in an expanded version of this documentation.
